Skip to content

Intern s2 preview lite awq fix bug#4600

Open
43758726 wants to merge 11 commits into
InternLM:mainfrom
43758726:InternS2_preview_lite_awq_fix_bug
Open

Intern s2 preview lite awq fix bug#4600
43758726 wants to merge 11 commits into
InternLM:mainfrom
43758726:InternS2_preview_lite_awq_fix_bug

Conversation

@43758726
Copy link
Copy Markdown
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings May 19, 2026 15:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates LMDeploy Lite quantization/calibration paths to better support Qwen3.5 / InternS2Preview architectures and to improve AWQ usability (including a “data-free” mode), alongside a small VLM utility update and a batch-splitting fix.

Changes:

  • Add InternS2Preview/Qwen3.5 model build support in the VLM wrapper and fix batch splitting for Qwen3.5 position_embeddings.
  • Introduce lmdeploy.lite.model registry-based per-architecture helpers to drive skip patterns (and some MoE parameter rewrites), and propagate skip lists into quantization_config.
  • Refactor calibration loading to return the resolved HF architecture and add calib_samples=0 flow for data-free AWQ.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
lmdeploy/vl/model/qwen3_5.py Adds build_model() handling for Qwen3.5 and InternS2Preview VLM variants.
lmdeploy/lite/utils/batch_split.py Adjusts splitting logic for Qwen3.5 position_embeddings tuple layout.
lmdeploy/lite/quantization/awq.py Adds new skip-pattern plumbing and changes skip logic; extends layernorm mapping for Qwen3 MoE.
lmdeploy/lite/model/base.py Introduces MODELS registry base helper for model-specific quantization support.
lmdeploy/lite/model/qwen.py Registers Qwen3/Qwen3.5/InternS2Preview skip patterns and MoE conversion helper.
lmdeploy/lite/model/mixtral.py Registers Mixtral helper and version-dependent skip patterns.
lmdeploy/lite/model/init.py Initializes Lite model registry and imports registered helpers.
lmdeploy/lite/apis/smooth_quant.py Threads trust_remote_code, consumes new calibrate return shape, and writes modules_to_not_convert.
lmdeploy/lite/apis/calibrate.py Refactors model/tokenizer loading, expands supported model maps, and returns arch.
lmdeploy/lite/apis/auto_awq.py Adds calib_samples=0 data-free mode and uses per-arch helpers/skip list propagation.
lmdeploy/cli/utils.py Updates CLI help text to document --calib-samples 0.
lmdeploy/archs.py Removes workspace (TurboMind converted model) shortcut from get_task().
Comments suppressed due to low confidence (1)

lmdeploy/archs.py:146

  • get_task() no longer handles local TurboMind converted/workspace model directories (typically containing triton_models/weights). Without this short-circuit, calling get_task() on a converted TurboMind model path will fall through to get_model_arch() and likely fail because there is no HF config to load. Please restore the workspace detection (or add equivalent handling in get_model_arch()) so converted TurboMind models continue to be recognized correctly.
def get_task(backend: str, model_path: str, trust_remote_code: bool = False):
    """Get pipeline type and pipeline class from model config."""
    from lmdeploy.serve.core import AsyncEngine

    _, config = get_model_arch(model_path, trust_remote_code=trust_remote_code)
    if check_vl_llm(backend, config.to_dict()):
        from lmdeploy.serve.core import VLAsyncEngine
        return 'vlm', VLAsyncEngine

    # default task, pipeline_class
    return 'llm', AsyncEngine

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 19 to +23
'Qwen2ForCausalLM': 'Qwen2DecoderLayer',
'Qwen3ForCausalLM': 'Qwen3DecoderLayer',
'Qwen3MoeForCausalLM': 'Qwen3MoeDecoderLayer',
'Qwen3_5ForConditionalGeneration': 'Qwen3_5DecoderLayer',
'Qwen3_5MoeForConditionalGeneration': 'Qwen3_5MoeDecoderLayer',
'LlavaLlamaForCausalLM': 'LlamaDecoderLayer',
'MGMLlamaForCausalLM': 'LlamaDecoderLayer', # mini gemini
'InternLMXComposer2ForCausalLM': 'InternLM2DecoderLayer',
'InternS2PreviewForConditionalGeneration': 'InternS2PreviewDecoderLayer',
'Qwen3MoeDecoderLayer': {
'input_layernorm': ['self_attn.k_proj', 'self_attn.q_proj', 'self_attn.v_proj'],
'post_attention_layernorm': ['mlp.gate_proj', 'mlp.up_proj']
},
Comment on lines +137 to +141
"""

patterns.extend(SKIPPED_MODULE)

def skipped_module(name: str):
"""Whether the module should be skipped from quantization."""
for m in SKIPPED_MODULE:
if m in name:
return True
return False
return next(((True, pattern) for pattern in patterns if pattern in name), (False, None))
Comment thread lmdeploy/lite/apis/calibrate.py Outdated
Comment on lines +183 to +186
def get_task(backend: str, model_path: str):
"""Get pipeline type and pipeline class from model config."""

_, config = get_model_arch(model_path)
Comment thread lmdeploy/lite/apis/smooth_quant.py Outdated
torch.cuda.empty_cache()
patterns = []
skipped_modules = []
arch = model.config.architectures[0]

@classmethod
def skipped_modules(cls):
pass
@43758726 43758726 marked this pull request as draft May 25, 2026 12:27
@43758726 43758726 marked this pull request as ready for review May 26, 2026 11:49
dtype = TORCH_DTYPE_TO_STR[torch_dtype]
configs = [model.config]

for name in ['text_config', 'llm_config', 'vision_config', 'ts_config']:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any special reason to check vision_config and ts_config?


patterns = []
skipped_modules = []
arch = model.config.architectures[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't "MODELS.get(arch)" fail if arch = model.config.architectures[0] is removed?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +128 to +134
dtype = 'bfloat16'

if isinstance(dtype, torch.dtype):
return dtype
return STR_TO_TORCH_DTYPE[dtype]


if sub_config is not None:
configs.append(sub_config)

for config in configs:
Comment on lines +338 to +343
_set_use_cache(model)
torch_dtype = _get_torch_dtype(original_config)
_set_config_dtype(model, torch_dtype)
if dtype == 'float16' or (dtype == 'auto' and torch_dtype == torch.float16):
model.half()
elif dtype == 'bfloat16' or (dtype == 'auto' and original_config.torch_dtype == torch.bfloat16):
elif dtype == 'bfloat16' or (dtype == 'auto' and torch_dtype == torch.bfloat16):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants